Deep Learning


Single Layer Neural Network



Example: MNIST Digits

  • Handwritten digits 28 × 28 grayscale images 60K train, 10K test images Features are the 784 pixel grayscale values \(\in\) (0, 255) Labels are the digit class 0–9


  • Goal: build a classifier to predict the image class

  • We build a two-layer network with 256 units at first layer, 128 units at second layer, and 10 units at output layer

  • Along with intercepts (called biases) there are 235,146 parameters (referred to as weights)


  • Details of output layer

    • Let \(Z_m = \beta_{m0} = \sum_{\ell = 1}^{K_2} \beta_{m \ell}A_{\ell}^2, m = 0, 1, ..., 9\) be 10 linear combinations of activations at the second layer

    • Output activation function encodes the softmax function

      • \(f_m(X) = Pr(Y = m|X) = \frac{e^{Zm}}{\sum_{\ell = 0}^{9}e^{Z \ell}}\)
    • We fit the model by minimizing the negative multinomial log-likelihood (or cross-entropy):

      • \(-\sum_{i=1}^{n} \sum_{m=0}^{9} y_{im} log(f_m(x_i))\)
    • \(y_{im}\) is 1 if true class for observation \(i\) is \(m\), else 0 (one-hot encoded)

  • Results

    • Early success for neural networks in the 1990s

    • With so many parameters, regularization is essential

    • Some details of regularization and fitting will come later


  • Very overworked problem — best reported rates are < 0.5%

  • Human error rate is reported to be around 0.2%, or 20 of the 10K test images


Convolutional Neural Network - CNN



Convolution Filter



Pooling



Architecture of a CNN



Using Pretrained Networks to Classify Images



Document Classification


Lasso versus Neural Network – IMDB Reviews


  • Simpler lasso logistic regression model works as well as neural network in this case

  • glmnet was used to fit the lasso model, and is very effective because it can exploit sparsity in the \(\mathbf{X}\) matrix


Recurrent Neural Networks




Time Series Forcasting



Autocorrelation


  • The autocorrelation at lag \(\ell\) is the correlation of all pairs \((v_t,v_{t-\ell})\) that are \(\ell\) trading days apart

  • These sizable correlations give us confidence that past values will be helpful in predicting the future

  • This is a curious prediction problem: the response \(v_t\) is also a feature \(v_{t-\ell}\)


RNN Forecaster




Autoregression Forecaster



Summary of RNNs


When to Use Deep Learning


Fitting Neural Networks



Non Convex Functions and Gradient Descent




Tricks of the Trade


Dropout Learning



Ridge and Data Augmentation



Data Augmentation on the Fly



Double Descent




Some Facts